1 Executive Summary

1.1 Introduction

This coursework focuses on applying a multivariate linear regression model to predict the total cost for two travelers staying for four nights in an Airbnb in Vienna, Austria. The data set is quite large and raw (non-numerical values, variables with several NA values etc.), so this is as much a data management project as it is a modelling assignment. In the exploratory analysis section, we first start by analysing the data to identify potential deterministic relationships. We also make sure to tidy up the data by transforming non-numerical variables and by tailoring certain variables to optimise a final model. This analysis is aided by some visualisations, which aid our interpretation of the data and variable selection. The model fitting consists of applying a forward stepwise regression i.e. we test if the addition of a variable improves our model’s explanatory ability iteratively whilst controlling for/partialling out the effect of other variables. We apply a significance level of 5% throughout the analysis. We perform a log-linear regression which means that we transform the target variable (price for two people to stay in Vienna for four nights) to its natural log. This means we can interpret the coefficients on the explanatory variables as approximate expected percentage changes in price for four nights. This methodology has been well-researched, so our contribution is strictly empirical.

1.2 Results

Our optimal model reads as follows: \(Y = 3.64 - 0.44(Private\ Room) - 0.57(Shared\ Room) - 0.17(T5) - 0.13(T4) + 0.05(T2) + 0.38(Innere\ Stadt) + 0.29(Rating) + 0.09(Accommodates) + 0.14(Bathrooms) + 0.06(Superhost) + 0.01(Availability\_30) + \epsilon_i\) - where Y = ln(price_4_nights).

This analysis is very approachable for the average person given the obvious logic behind each of the variables. We can interpret the following factors which influence the average expected price to stay in Vienna for four nights as follows:

  • Private Room: the negative coefficient shows that you should expect to pay less for a private room in a shared accommodation as opposed to a hotel room (which is the baseline). This signifies a negative premium (c. -44%) for having to share the other facilities in an accommodation. This is a dummy variable, i.e. is either one or zero if it is a private room or not.
  • Shared Room: this has a more negative coefficient than the Private Room which signifies that people on average should expect to pay less for a shared room than a private room, which makes sense (e.g. hostel vs hotel)
  • The variables T5, T4, T2, and Innere Stadt are location variables which represent pools of Vienna’s 23 neighbourhoods grouped by mean rental price. Innere Stadt is the most expensive area by quite a lot, so is isolated. The higher the number in the other tiers, the less expensive AirBnB rental prices are on average. T3 is the assumed baseline category. It makes sense that neighbourhood should be an important explanatory factor of rental prices, as location is cornerstone of valuation drivers for real estate markets. The neighbourhoods were also better at explaining the variation in the price data than distance from the center of the city, which makes sense as changes in accommodation quality and demand isn’t generally constant the further you go from the centre in unplanned cities
  • Rating: this variable denotes the AirBnB rating for the property out of five on the website. This is an obvious factor affecting price of accommodation as demand for higher rated properties should be higher on average than lower rated properties, as previous experiences there are expected to be better on average. This is especially important when travelers sort the accommodation options on the website in terms of rating which is often the case.
  • Accommodates: this coefficient reads that if an accommodation can fit one more person, on average, the rental price for four nights should increase by 9%. The direction of the variable makes sense; however, we would expect this relationship to be convex in general i.e. the difference between the price of an accommodation for two people versus three should be greater than for eleven versus ten, based on logic.
  • Bathrooms: an increase of one in the number of bathrooms should increase the average rental price for four nights by 14% on average, when controlling for the other variables.
  • Superhost: a Boolean variable which is True if the host is a superhost. The positive relationship here makes sense as you would expect travelers to pay a premium on average to stay with a Super Host. This variable is unique to the AirBnB platform.
  • Availability_30: how many days the room is available in the next 30 days. This variable is an extremely useful gauge of current demand for the property. Logically, we would expect demand to be conditional on price as opposed to the structure above, but there appears to be some simultaneity between the two variables.

1.3 Source the Data

1.3.1 Load the Data

We first load the data and parse the price variable into an integer.

# Price is a character, parse to integer
vienna_listings <- vienna_listings %>% 
  mutate(price = parse_number(price))

# In case of duplicate rows
unique_listings = unique(vienna_listings[,])

1.3.2 Mapping

The following code plots on the map all AirBnBs where minimum_nights is less than equal to four.

leaflet(data = filter(vienna_listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Using this map, we can be redirected to any airbnb we want to check out along with an overview of their location. We know move on to exploratory analysis.

2 Data Exploration and Feature Selection (EDA)

We normally conduct a thorough EDA via three methods:

  • Looking at the raw values.
  • Computing summary statistics of the variables of interest, or finding NAs
  • Creating informative visualizations

However, we analyse the correlation matrices later in the process when we have narrowed down further our “meaningful” variables, i.e. those with a suspected possible significant explanatory ability of the variation of price.

2.1 Variable Summary and Selection

2.1.1 Glimpse Data

# Use Glimpse from the Dplyr package to get an initial look at the raw data
kable(dplyr::glimpse(vienna_listings))

Our data has 11583 rows/observations with 74 variables.

2.1.2 Skim Data

# Generate summary statistics with mosaic and skimr packages
kable(skimr::skim(vienna_listings))

There are 37 independent numeric variables, 9 logical, 5 date and 22 character variables.

2.2 Data Wrangling

2.2.1 Logical Selection of Variables

Since there are 73 independent variables and many of them don’t make any sense (eg. url) for price prediction, we select some variables for simplifying later exploration. According to our hypothesis, the following variables could potentially affect the price.

# Logical screening

vienna_listings_cleaned <- vienna_listings %>% 
  select(
    price,
    host_response_time,
    host_response_rate,
    host_acceptance_rate,
    host_is_superhost,
    host_total_listings_count,
    host_identity_verified,
    neighbourhood_cleansed, #23 areas
    latitude,
    longitude,
    property_type,
    room_type,
    bathrooms_text,
    bedrooms,
    accommodates,
    beds,
    minimum_nights,
    maximum_nights,
    availability_30,
    availability_365,
    instant_bookable,
    number_of_reviews_ltm,
    number_of_reviews,
    review_scores_rating,
    reviews_per_month
  )
#glimpse and skim again
kable(glimpse(vienna_listings_cleaned))
skimr::skim(vienna_listings_cleaned) # kable is too slow with skim

2.2.2 Analysis of Price

Then, we plot the distribution of price. As the distribution is significantly right skewed, we use the log function to aptly transform the price data.

# Visualise the Price
vienna_listings_cleaned %>% ggplot(aes(x=price)) + 
  geom_density()+labs(
    title = "Price PDF")+
  theme_economist()

# Our data is very positively skewed which will make prediction more difficult, so we take the natural log of price to normalise the data
vienna_listings_cleaned <- vienna_listings_cleaned %>% 
  mutate(lnprice = log(price))

# Visualise the Data
vienna_listings_cleaned %>% ggplot(aes(x=lnprice)) + 
  geom_density()+labs(
    title = "Log Price PDF")+
  theme_economist()

confint(vienna_listings_cleaned$price, mean, level = 0.9)
5%95%
20169

As the log price is still a bit right skewed, we then plot the 5% and 95% percentile of the price. This gives us approximate price upper and lower bounds of €20 and €169 respectively.

# Remove positive outliers and non-sensical/irrelevant values left of the mean to build a more meaningful analsysis
vienna_listings_cleaned <- vienna_listings_cleaned %>% 
  filter(between(lnprice, log(20), log(169)), na.RM=TRUE)

# Visualise the Data
vienna_listings_cleaned %>% ggplot(aes(x=lnprice)) + 
  geom_density()+labs(
    title = "Log Price PDF (Tails Removed)")+
  theme_economist()

# save this dataset
vienna_listings_final <- vienna_listings_cleaned

Now the natural log of price is approximatley normally distributed within these bounds; thus, we use lnprice as our dependent variable to construct our model.

2.3 Transform to Numerical: Bathrooms, host_response_rate, host_acceptance_rate

We then change the non-numerical variables such as bathrooms, host_response_rate, and host_acceptance_rate to numerical variables.

vienna_listings_final <- vienna_listings_final%>%
  mutate(bathrooms = parse_number(bathrooms_text),
         host_response_rate = parse_number(host_response_rate),
         host_acceptance_rate = parse_number(host_acceptance_rate))%>%
  select(-bathrooms_text)
#glimpse and skim again
kable(glimpse(vienna_listings_final))
kable(skim(vienna_listings_final))

2.3.1 Characters to Factor Values

# Create factor variables for host response time 
kable(unique(vienna_listings_final$host_response_time)) %>% 
  kable_styling()
x
within an hour
N/A
within a few hours
within a day
a few days or more
NA
invisible(vienna_listings_final %>% 
  mutate(
    # mark "N/A" as missing values
    host_response_time = ifelse(host_response_time =="N/A",NA,host_response_time),
    # relevel
    host_response_time = fct_relevel(host_response_time,
                                            "within an hour", 
                                            "within a few hours",
                                            "within a day",
                                            "a few days or more"
                                            )))
# Create factor variables for room types 
kable(unique(vienna_listings_final$room_type))%>% 
  kable_styling()
x
Hotel room
Entire home/apt
Private room
Shared room
invisible(vienna_listings_final$room_type <- factor(vienna_listings_final$room_type, 
                                          labels = unique(vienna_listings_final$room_type)))

In the original raw data, there are 74 Variables with 11,583 Observations. After filtering the data based on log price (to remove NA price values and outliers), we have 10,602 observations.

After variable selection based on logic, we have 26 variables.

  • The following variables are numbers: Numeric Variables * 19: host_total_listings_count, host_response_rate, host_acceptance_rate, accommodates, longitude, latitude, beds, bedrooms,bathrooms, variables for maximum and minimum nights, number of reviews, availabiity data, review scores data, calculated host data etc.

  • The following are categorical or factor variables: Logical/Boolean Variables * 3 : host_is_superhost, host_identity_verified, instant_bookable Character variables * 4 : host_response_time, neighbourhood_cleansed, property_type, room_type

2.3.2 Property Types

Next, we look at the variable property_type. We use the count function to determine how many categories there are their frequency. The top 4 most common property types are “Entire rental unit”, “Private room in rental unit”, " Entire condominium (condo)“, and”Entire serviced apartment". The four types account for 91.19% proportion of the total listings.

# Identify the amount of each property type
property_type_count <- vienna_listings_final %>%
    count(property_type) %>%
    mutate(percentage = n/sum(n)*100)%>%
    arrange(desc(n))

kable(property_type_count)
property_type n percentage
Entire rental unit 6975 65.7894737
Private room in rental unit 1915 18.0626297
Entire condominium (condo) 485 4.5746086
Entire serviced apartment 293 2.7636295
Entire loft 154 1.4525561
Private room in residential home 103 0.9715148
Private room in condominium (condo) 98 0.9243539
Entire residential home 83 0.7828712
Room in hotel 66 0.6225241
Room in boutique hotel 51 0.4810413
Private room in bed and breakfast 48 0.4527448
Private room in hostel 45 0.4244482
Shared room in rental unit 41 0.3867195
Room in serviced apartment 29 0.2735333
Room in aparthotel 24 0.2263724
Private room in loft 21 0.1980758
Entire guest suite 19 0.1792115
Private room in townhouse 15 0.1414827
Private room in guesthouse 13 0.1226184
Private room in serviced apartment 11 0.1037540
Entire townhouse 10 0.0943218
Shared room in hostel 10 0.0943218
Entire place 9 0.0848896
Entire bungalow 8 0.0754575
Private room in villa 7 0.0660253
Entire villa 6 0.0565931
Entire guesthouse 5 0.0471609
Private room in bungalow 5 0.0471609
Room in bed and breakfast 5 0.0471609
Shared room in bed and breakfast 5 0.0471609
Camper/RV 4 0.0377287
Shared room in condominium (condo) 4 0.0377287
Entire cottage 3 0.0282965
Private room 3 0.0282965
Private room in casa particular 3 0.0282965
Casa particular 2 0.0188644
Private room in castle 2 0.0188644
Private room in guest suite 2 0.0188644
Shared room in residential home 2 0.0188644
Shared room in serviced apartment 2 0.0188644
Tiny house 2 0.0188644
Dome house 1 0.0094322
Entire cabin 1 0.0094322
Entire chalet 1 0.0094322
Entire hostel 1 0.0094322
Lighthouse 1 0.0094322
Private room in barn 1 0.0094322
Private room in bus 1 0.0094322
Private room in camper/rv 1 0.0094322
Private room in cave 1 0.0094322
Private room in earth house 1 0.0094322
Private room in farm stay 1 0.0094322
Private room in nature lodge 1 0.0094322
Private room in treehouse 1 0.0094322
Shared room in loft 1 0.0094322

Top 4:

  • Entire rental unit (65.78%)
  • Private room in rental unit (18.06%)
  • Entire condominium (condo) (4.57%)
  • Entire serviced apartment (2.76)

Since the vast majority of the observations in the data are one of the top two property types (84%), we would like to combine Entire condominium (condo), Entire serviced apartment and Entire loft as a type called Entire apartment. This method was better suited to the Vienna dataset, we require sufficient datapoints for each category in order to reduce our model’s standard errors.

We then create a simplified version of property_type variable that has 4 categories: the top two categories, Entire apartment and Other.

# First we need to summarize the other values in the Category "Others"
vienna_listings_final <- vienna_listings_final %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit") ~ property_type, 
    property_type %in% c( "Entire condominium (condo)", "Entire serviced apartment","Entire loft") ~ "Entire apartment",
    TRUE ~ "Other"
  ))
# Check if done correctly
kable(vienna_listings_final %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n)))
property_type prop_type_simplified n
Entire rental unit Entire rental unit 6975
Private room in rental unit Private room in rental unit 1915
Entire condominium (condo) Entire apartment 485
Entire serviced apartment Entire apartment 293
Entire loft Entire apartment 154
Private room in residential home Other 103
Private room in condominium (condo) Other 98
Entire residential home Other 83
Room in hotel Other 66
Room in boutique hotel Other 51
Private room in bed and breakfast Other 48
Private room in hostel Other 45
Shared room in rental unit Other 41
Room in serviced apartment Other 29
Room in aparthotel Other 24
Private room in loft Other 21
Entire guest suite Other 19
Private room in townhouse Other 15
Private room in guesthouse Other 13
Private room in serviced apartment Other 11
Entire townhouse Other 10
Shared room in hostel Other 10
Entire place Other 9
Entire bungalow Other 8
Private room in villa Other 7
Entire villa Other 6
Entire guesthouse Other 5
Private room in bungalow Other 5
Room in bed and breakfast Other 5
Shared room in bed and breakfast Other 5
Camper/RV Other 4
Shared room in condominium (condo) Other 4
Entire cottage Other 3
Private room Other 3
Private room in casa particular Other 3
Casa particular Other 2
Private room in castle Other 2
Private room in guest suite Other 2
Shared room in residential home Other 2
Shared room in serviced apartment Other 2
Tiny house Other 2
Dome house Other 1
Entire cabin Other 1
Entire chalet Other 1
Entire hostel Other 1
Lighthouse Other 1
Private room in barn Other 1
Private room in bus Other 1
Private room in camper/rv Other 1
Private room in cave Other 1
Private room in earth house Other 1
Private room in farm stay Other 1
Private room in nature lodge Other 1
Private room in treehouse Other 1
Shared room in loft Other 1

Next we can make a factor out of prop_type_simplified.

vienna_listings_final <- vienna_listings_final %>% 
  mutate(
     prop_type_simplified = fct_relevel(prop_type_simplified,
                                 "Entire rental unit",
                                 "Private room in rental unit",
                                 "Entire apartment",
                                 "Other")) %>%
  select(-property_type)

2.3.3 Analysis of Minimum Nights

We only want to include listings in our regression analysis that are intended for travel purposes, as this is the target user for our model.

# right skewed
vienna_listings_final %>% 
  ggplot(aes(x=minimum_nights))+
  geom_histogram() +
  theme_economist() +
  labs(title="Minimum Nights Distribution")

vienna_listings_final %>% 
  filter(minimum_nights<=10) %>%
  mutate(minimum_nights=as.integer(minimum_nights)) %>%
  ggplot(aes(x=minimum_nights))+
  geom_bar()+
  scale_x_continuous(breaks = seq(0,10,1)) +
  theme_economist() +
  labs(title="Minimum Nights Distribution (Outliers Removed)")

  • The most common values for the variable minimum_nights is 1,2,3.
  • 1 minimum night stands out among the common values.
  • The seemingly unusual large values for minimum_nights probably represent listings for long-term tenants, not for travelers.

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

vienna_listings_final <- vienna_listings_final %>% 
  filter(minimum_nights<=4) %>% 
  filter(maximum_nights>=4) # Since we will be modelling the price to stay for four nights, maximum stay must be greater than or equal to 4
invisible(vienna_listings_final)

We now have only 8,844 listings.

2.3.4 Analysis of Location: Neighbourhood, Longitude and Latitude

We want to leverage the longitude and latitude variables by calculating the distance between every listing and center of Vienna. We take postcode 1010 as the center of Vienna, which we confirmed with a local citizen.

# Calculate the distance, meters as unit
vienna_listings_final <- vienna_listings_final %>% 
  rowwise() %>% 
  mutate(
    dist_from_cent = distm(c(latitude, longitude), c(48.208604427334215, 16.37353507157435), 
                      fun = distVincentyEllipsoid)[1,1]
  ) 

We then compare median distance of each neighbourhood. Use median to avoid outliers.

dist_neighbourhood <- vienna_listings_final %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarise(
    med_dist = median(dist_from_cent),
    med_price = median(price),
    count_listings = n()
  ) %>% 
  arrange(med_dist)

kable(dist_neighbourhood) %>% 
  kable_styling()
neighbourhood_cleansed med_dist med_price count_listings
Innere Stadt 525.6608 107.0 417
Wieden 1874.1312 61.0 372
Leopoldstadt 2062.9222 60.0 962
Landstra§e 2374.3012 63.0 762
Brigittenau 2695.4915 55.0 356
Alsergrund 2697.6900 60.0 582
Josefstadt 2774.0108 58.0 290
Margareten 2952.5362 53.0 493
Neubau 3068.8588 64.0 545
Mariahilf 3150.7270 65.0 404
Favoriten 3740.6468 56.0 524
W<U+008A>hring 4064.0061 55.0 256
D<U+009A>bling 4669.0753 54.5 176
Hernals 4980.3359 50.0 279
Ottakring 5029.5999 49.0 407
Rudolfsheim-F<U+009F>nfhaus 5056.9721 55.0 656
Meidling 5347.0372 55.0 365
Simmering 6171.0211 55.0 112
Floridsdorf 6537.6258 55.0 156
Penzing 8115.2166 51.0 232
Donaustadt 9265.7734 63.0 327
Hietzing 10790.6303 59.0 110
Liesing 11179.5870 69.0 61

We now plot Neighbourhood median price against median distance from city center

dist_neighbourhood %>% 
  ggplot(aes(x=med_dist,y=med_price,size=count_listings))+
  geom_point(color="#1A5276",alpha=0.4)+
  geom_text_repel(aes(label=neighbourhood_cleansed),size=4)+
  scale_size_continuous(range=c(4,20))+
  theme_bw()+
  theme(
        plot.title =element_text(size=16, face='bold',hjust = 0,margin = margin(10,0,10,0)),
        plot.subtitle =element_text(size=16, hjust = 0), 
        axis.text.x = element_text(size=10),
        axis.text.y = element_text(size=12),
        axis.ticks.x = element_line(),
        axis.ticks.y=element_line(),
        axis.title.x = element_text(size=12,face='bold'),
        axis.title.y = element_text(size=12,face='bold'),
        ) +
  labs(title = "Neighbourhood median price against median distance from city center", 
       x="Median distance from city center",y="Median Price")+
  theme_economist_white()

Innere Stadt is the neighbourhood located right at city center (587.45m in average); thus, it exhibits a high average price.

Neighbourhoods located less than 6 km to city center offer listings for price in range 55 to 65, with price decreasing when distance increases on average.

Interestingly, listings located more than 9km away have higher price for 60-70. There appears to be no certain pattern/relationship between distance from the center and price, but we will investigate further later.

# Grouping by neighbourhood_cleansed
neighbourhood_sorted = vienna_listings_final %>% 
 group_by(neighbourhood_cleansed) %>% 
  summarise(mean_price = mean(price)) %>% 
  arrange(-mean_price)

kable(neighbourhood_sorted) %>% 
  kable_styling()
neighbourhood_cleansed mean_price
Innere Stadt 106.51319
Landstra§e 70.88320
Wieden 69.38978
Liesing 69.14754
Mariahilf 68.78960
Neubau 67.91009
Leopoldstadt 67.06757
Donaustadt 65.33333
Alsergrund 65.32646
Hietzing 64.89091
Brigittenau 63.81461
Josefstadt 63.80000
D<U+009A>bling 63.50568
W<U+008A>hring 62.71094
Margareten 62.65314
Favoriten 62.40840
Rudolfsheim-F<U+009F>nfhaus 60.36280
Simmering 59.44643
Floridsdorf 58.80769
Meidling 58.16164
Hernals 57.46237
Penzing 56.77155
Ottakring 56.02211

We would like to divide our neighbourhoods into five different areas, as for the 23 we currently have, estimation error embedded in our final model would be too high out of sample, potentially giving our out-of-sample predictions massive standard errors. We decide to group the areas by similar mean prices. Since Innere Stadt is such an outlier in terms of price, we divide the rest of the areas into four groups (5,6,6,5).

Note that however, we get unicode characters in three of the neighbourhood_cleansed variables. However, we can avoid any hassle and simply call the variable from the table of neighbourhood_cleansed mean prices. We will later assess our groupings to see if they are statistically significant predictors of price.

# First we need to summarize the other values in the Category "Others"
vienna_listings_final <- vienna_listings_final %>%
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c("Innere Stadt") ~ neighbourhood_cleansed, 
    neighbourhood_cleansed %in% c("Wieden", "Landstra§e", "Mariahilf", "Neubau", "Liesing") ~ "T2",
    neighbourhood_cleansed %in% c("Leopoldstadt", "Alsergrund", "Hietzing", "Donaustadt", "Josefstadt") ~ "T3",
    neighbourhood_cleansed %in% c(neighbourhood_sorted[12,1], "Brigittenau", "Favoriten", neighbourhood_sorted[15,1], "Margareten", "Floridsdorf") ~ "T4",
    neighbourhood_cleansed %in% c(neighbourhood_sorted[18,1], "Meidling", "Simmering", "Penzing", "Hernals", "Ottakring") ~ "T5"
  ))

# Now we need to factorise the new variable
vienna_listings_final <- vienna_listings_final %>% 
  mutate(
     neighbourhood_simplified = fct_relevel(neighbourhood_simplified,
                                 "Innere Stadt",
                                 "T2",
                                 "T3",
                                 "T4",
                                 "T5")) %>%
  select(-neighbourhood_cleansed)
# Do boxplot for each neighbourhood simplified
vienna_listings_final %>% 
  filter(!is.na(neighbourhood_simplified)) %>% 
  ggplot(aes(y=price)) + 
  geom_boxplot(aes(x=fct_reorder(neighbourhood_simplified,price,median,.desc=TRUE)))+
  labs(title = "Median Price by Neighbourhood")+
  theme_economist()+
  theme(
    axis.title.x= element_blank()) # have to put this after we use economist theme or it puts the x axis title back

We can see see the gradual decline in median price from T2 to T5, but, again, Innere Stadt really stands out as the most in demand location for travelers visiting Vienna. As we saw earlier however, the median price for the locations doesn’t appear to be dependent on location in general. We cross-checked the logic of this with our resident Vienna group member, who confirmed that some of the nicest areas which are generlly in quite high demand from tourists are outside the city centre. Since Vienna’s attractions lie in its architecture and general beauty, we can assume that it is probably a more mature tourist audience, who may want to avoid the hustle and bustle of the city, seeing as nightlife isn’t the primary attraction of Vienna.

We can reasonably delete longitude and latitude from dataset now.

vienna_listings_final <- vienna_listings_final %>% 
  select(-c(longitude,latitude))

2.3.5 Target Variable

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four nights. We will create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

vienna_listings_reg <- vienna_listings_final %>% 
  filter(accommodates>=2) %>%
  mutate(price_4_nights = price * 4,
         lnprice_4_nights = log(price_4_nights))
# Visualise the Price
vienna_listings_reg %>% ggplot(aes(price_4_nights)) + 
  geom_density()+
  labs(title = "Price for Four Nights")+
  theme_economist()

# Visualise the Price
vienna_listings_reg %>% ggplot(aes(x=lnprice_4_nights)) + 
  geom_density()+
  labs(title = "Log Price for Four Nights")+
  theme_economist()

We use log(price_4_nights) for the regression model for the reasons mentioned earlier. The positive skew in the price_4_nights variable makes it more difficult to predict extreme values of price, relative to the normalised natural logarithm.

2.3.6 Deal with Missing/Unreliable Values

There some missing values in review_scores_rating, which we are quite confident will be a significant predictor. We decide to remove these rows (the lm regression in R would remove these automatically but we prefer to do it now to have the data cleaner). We also decide that it makes sense to remove very new or “unrated” properties as data on reviewed properties will on average be much more reliable, and we still have a sufficient number of datapoints to formulate a model.

vienna_listings_reg <- vienna_listings_reg %>%
  filter(!is.na(review_scores_rating)) %>% 
  filter(number_of_reviews>=10)

2.4 Split the Dataset into Training and Testing

# Set seed so we will get the same results
set.seed(202110)

# Split our data 75% as train and 25% as test
train_test_split <- initial_split(vienna_listings_reg, prop=0.75)
vienna_train <- training(train_test_split)
vienna_test <- testing(train_test_split)

3 Model Selection and Validation

3.1 Regression Analysis

Model selection is a structured process. We employ a forward regression process, as we test the significance of variables whilst controlling for others in order to avoid including variables with high collinearity. We first assess the correlation matrix of our numerical explanatory variables in order to determine which variables have the strongest relationships with the target variable, and to highlight variables with high collinearity so as to avoid multicollinearity issues amongst the regressors. We then iterate through a number of models in order to determine if the increase in adjusted r squared is worth the increase in exposure to estimation error in a stepwise forward regression processz. When building regression models with a large number of potential independent variables, we must be cautious of the tradeoff between in-sample bias and out of sample variance of predictions. To optimise our model based on this tradeoff, we can compare the adjusted R squared against the out of sample RMSE (from the testing data). We also perform other model diagnostics such as analysis of the residuals, in order to ensure that they are approximately normal and homoskedastic with a mean of zero.

3.1.1 Correlation of Numerical Variables

# Scatterplot/Correlation matrix for numerical variables
vienna_train %>% 
  select(where(is.numeric)) %>% 
  select(-price )%>%
  ggpairs()+
  theme_economist()

This correlogram gives us a better indication of which variables may explain more of the target variable, which we will use to gear our forward regression in the right direction.

We can also see there are many variables which have a strong correlation with each other, eg. beds and accommodates. This means we need to consider multicollinearity in the regression amongst the predictors.

However, some scatterplots cannot support a linear relationship between variables, eg. price and total listings of the host. There could be some correlations conditional on the value of a categorical variable also which we need to be careful of, eg. the correlation between price and bedrooms could be conditional on property type or room type, since whether the listing is entire or a private room for rental might influence the number of bedrooms. For this reason we simply use the correlogram as a limited source of guidance to help us start off in the right direction

3.1.2 Model 1

We fit regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

model1 <- lm(lnprice_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating, data = vienna_train)
par(mfrow=c(2,2)) 
msummary(model1)
                                                  Estimate Std. Error t value
(Intercept)                                      4.377e+00  1.650e-01  26.520
prop_type_simplifiedPrivate room in rental unit -6.240e-01  2.110e-02 -29.574
prop_type_simplifiedEntire apartment             6.026e-02  2.871e-02   2.099
prop_type_simplifiedOther                       -1.726e-01  3.588e-02  -4.812
number_of_reviews                               -1.791e-04  9.057e-05  -1.977
review_scores_rating                             2.559e-01  3.469e-02   7.375
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
prop_type_simplifiedEntire apartment              0.0359 *  
prop_type_simplifiedOther                       1.57e-06 ***
number_of_reviews                                 0.0481 *  
review_scores_rating                            2.09e-13 ***

Residual standard error: 0.3965 on 3105 degrees of freedom
Multiple R-squared:  0.238, Adjusted R-squared:  0.2368 
F-statistic:   194 on 5 and 3105 DF,  p-value: < 2.2e-16
kable(car::vif(model1)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.006153 3 1.001023
number_of_reviews 1.004995 1 1.002494
review_scores_rating 1.003322 1 1.001659
autoplot(model1) +
  theme_minimal() + 
  labs (title = "Model 1 Diagnostic Plots")

Metrics:

  • The p-value of model1 is smaller than 0.05, which means it is a significant regression model that produces a significant relationship to predict the log of the cost for 2 people and 4 nights at Vienna. The adjusted R-squared is 0.2368 which indicates that this model has limited explanatory power.
  • The residual standard error is quite high also at c. 0.4
  • We later look at the AIC and out-of-sample RMSE of the models, seeing as these are generally used for comparison
  • We use the vif() function to check for multicollinearity between the regressors. We get low VIF values here so it shouldn’t be an issue amongst these variables going forward

The Coefficients:

  • Our intercept here is 4.377. When we convert this to price it gives us a value of almost 80, which also shows that we must be omitting a large number of determinants of price
  • The review_scores_rating coefficient is significant, indicating that the rating scores of reviews on a listing have a significant impact on costs. To be specific, with other variables same, if the review score increase by 1, we expect the cost for 2 people and 4 nights at Vienna increases by approximately 2.56% (property of log-linear regression)
  • The coefficient of number_of_reviews is significant at the 5% level of significance, but is very tiny
  • The coefficient of prop_type_simplified can be interpreted as dummy variables. The baseline category is Entire rental unit. Thus, the other three dummy variables’ represent the deviation in mean log of price of four night stay of the three other property types
  • The coefficient for Private room in rental unit is significant, indicating that the cost for a Private room in rental unit is significantly lower than cost for an Entire rental unit. This makes sense that the coefficient is negative because private rentals are meant to share kitchens and bathrooms with other people, thus leading to lower costs. To be specific, with other variables same, a Private room in rental unit costs approx. 62.4% less than an Entire rental unit (properties of log-linear regression)
  • Coefficient for Entire apartment is significant, indicating that the cost for a Entire apartment is significantly higher than cost for an Entire rental unit. This makes sense that the coefficient is positive because Entire rental unit in this dataset is not specified and maybe worse than a Entire apartment. To be specific, with other variables same, an Entire apartment costs 6.02% more than an Entire rental unit.
  • Coefficient for Other properties is significant, indicating that cost for other properties is significantly lower than an Entire rental unit. This is probably because this category includes a mix of properties like share rooms. To be specific, with other variables same, Other properties costs 17.91% less than an Entire rental unit

3.1.3 Model 2

We now want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.

model2 <- lm(lnprice_4_nights ~ prop_type_simplified+room_type+number_of_reviews+review_scores_rating, data = vienna_train)
msummary(model2)
                                                  Estimate Std. Error t value
(Intercept)                                      4.408e+00  1.637e-01  26.926
prop_type_simplifiedPrivate room in rental unit -8.108e-02  7.767e-02  -1.044
prop_type_simplifiedEntire apartment             6.050e-02  2.838e-02   2.132
prop_type_simplifiedOther                        1.637e-01  5.778e-02   2.833
room_typeEntire home/apt                        -2.654e-02  1.502e-01  -0.177
room_typePrivate room                           -5.428e-01  7.482e-02  -7.255
room_typeShared room                            -9.182e-01  1.504e-01  -6.105
number_of_reviews                               -1.744e-04  8.973e-05  -1.943
review_scores_rating                             2.493e-01  3.441e-02   7.246
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  0.29660    
prop_type_simplifiedEntire apartment             0.03312 *  
prop_type_simplifiedOther                        0.00464 ** 
room_typeEntire home/apt                         0.85975    
room_typePrivate room                           5.04e-13 ***
room_typeShared room                            1.16e-09 ***
number_of_reviews                                0.05206 .  
review_scores_rating                            5.40e-13 ***

Residual standard error: 0.392 on 3102 degrees of freedom
Multiple R-squared:  0.2561,    Adjusted R-squared:  0.2542 
F-statistic: 133.5 on 8 and 3102 DF,  p-value: < 2.2e-16
kable(car::vif(model2)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 16.933444 3 1.602474
room_type 17.052748 3 1.604350
number_of_reviews 1.009458 1 1.004718
review_scores_rating 1.010395 1 1.005184
autoplot(model2) +
  theme_minimal() + 
  labs (title = "Model 2 Diagnostic Plots")

  • The adjusted R squared is 0.2542, which is a marginal improvement on the previous model at the cost of three extra sources of estimation error, which will increase the variance of our predicted means

  • We also get very high variance inflation factors for property type and room type, understandably. This is also indicated by the deviation of the previously significant property coefficient for private room in rental unit to being insignificant. Given that the room type variable only very marginally improved the model’s explanatory value, and introduces a lot of multicollinearity amongst the regressors, we decide to drop either the room_type or property_type variable. When we tested the above model with either room_type or prop_type_simplified, the room_type regression produced a higher standard error. We also prefer the room_type categories as they are more applicable to how a user would appraise a property

3.1.4 Model 3

We want to test if the number of bathrooms, bedrooms, beds, or size of the house (accomodates) are significant predictors of price_4_nights. However, we suspect a high degree of multicollinearity betwen these variables, so we make a correlation matrix between these variables to determine if there is a high corelation between the predictors, and which one demonstrates the strongest relationship with the target variable.

vienna_train %>% 
  select(lnprice_4_nights, bathrooms, bedrooms, accommodates, beds) %>% 
  ggpairs()+
  labs(title = "Correlation Matrix") 

  • Excluding bathrooms, there is a high degree of correlation between the predictors, and accommodates demonstrates the strongest relationship with the target variable. This makes sense, seeing as accommodates is determined by the number of beds and bedrooms in a property. Thus, we decide to only add in the Accommodates variable into our model.

  • Bathrooms demonstrates a relatively weak correlation coefficient with accommodates, but logically you would expect an accommodation with more bathrooms to cost more per person, when we control for the accommodates factor. Price is probably more sensitive to this variable when the number for accommodates is very high however.

model3 <- lm(lnprice_4_nights ~ room_type+number_of_reviews+review_scores_rating
             +accommodates+bathrooms, data = vienna_train)
msummary(model3)
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.706e+00  1.524e-01  24.319  < 2e-16 ***
room_typeEntire home/apt  2.978e-01  1.275e-01   2.336  0.01954 *  
room_typePrivate room    -4.790e-01  1.911e-02 -25.060  < 2e-16 ***
room_typeShared room     -9.443e-01  1.277e-01  -7.397 1.78e-13 ***
number_of_reviews        -2.451e-04  8.218e-05  -2.982  0.00288 ** 
review_scores_rating      2.916e-01  3.164e-02   9.215  < 2e-16 ***
accommodates              9.319e-02  4.347e-03  21.438  < 2e-16 ***
bathrooms                 1.397e-01  2.195e-02   6.366 2.23e-10 ***

Residual standard error: 0.3588 on 3092 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.3757,    Adjusted R-squared:  0.3743 
F-statistic: 265.9 on 7 and 3092 DF,  p-value: < 2.2e-16
kable(car::vif(model3)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_type 1.162031 3 1.025344
number_of_reviews 1.008566 1 1.004274
review_scores_rating 1.015483 1 1.007712
accommodates 1.226595 1 1.107517
bathrooms 1.110464 1 1.053786
autoplot(model3) +
  theme_minimal() + 
  labs (title = "Model 3 Diagnostic Plots")

  • Adjusted R-squared is 0.3743, greatly improved by factoring in accommodates and bathrooms, both of which are statistically significant (\(\alpha\) =5%).

  • Note the low multicollinearity amongst the regressors after the removal of the property type variable.

3.1.5 Model 4

We now want to test if superhosts command a pricing premium, after controlling for other variables.

model4 <- lm(lnprice_4_nights ~ room_type+number_of_reviews+review_scores_rating
             +accommodates+bathrooms+host_is_superhost, data = vienna_train)
msummary(model4)
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.040e+00  1.641e-01  24.621  < 2e-16 ***
room_typeEntire home/apt  3.232e-01  1.270e-01   2.546 0.010955 *  
room_typePrivate room    -4.733e-01  1.905e-02 -24.844  < 2e-16 ***
room_typeShared room     -9.274e-01  1.271e-01  -7.296 3.76e-13 ***
number_of_reviews        -3.200e-04  8.297e-05  -3.857 0.000117 ***
review_scores_rating      2.184e-01  3.436e-02   6.355 2.39e-10 ***
accommodates              9.256e-02  4.332e-03  21.369  < 2e-16 ***
bathrooms                 1.305e-01  2.191e-02   5.956 2.87e-09 ***
host_is_superhostTRUE     7.652e-02  1.456e-02   5.256 1.58e-07 ***

Residual standard error: 0.3571 on 3089 degrees of freedom
  (13 observations deleted due to missingness)
Multiple R-squared:  0.3816,    Adjusted R-squared:   0.38 
F-statistic: 238.2 on 8 and 3089 DF,  p-value: < 2.2e-16
kable(car::vif(model4)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_type 1.167476 3 1.026143
number_of_reviews 1.037542 1 1.018598
review_scores_rating 1.208526 1 1.099330
accommodates 1.228944 1 1.108577
bathrooms 1.116155 1 1.056482
host_is_superhost 1.246227 1 1.116345
autoplot(model4) +
  theme_minimal() + 
  labs (title = "Model 4 Diagnostic Plots")

  • Adjusted R-squared is 0.38, a barely improved by including the Boolean host_is_superhost variable. However, this regressor is statistically significant, and positive as expected. We decide to continue with the variable as it is extremely significant, and wait until we control for some additional regressors.

  • Note that the lm() function automatically discards rows with missing values for the explanatory variables. We were reluctant to remove some of these rows earlier, as we did not know if some of these variables would be statistically significant predictors of the target variable,; and thus, we may have discarded valuable data unnecessarily if they were not useful.

3.1.6 Model 5

We now want to test if the offering by some hosts allowing you to immediately book their listing is a significant predictor of the taget variable after controlling for other variables.

model5 <- lm(lnprice_4_nights ~ room_type+number_of_reviews+review_scores_rating+accommodates+bathrooms+host_is_superhost+instant_bookable, data = vienna_train)
msummary(model5)
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.089e+00  1.676e-01  24.394  < 2e-16 ***
room_typeEntire home/apt  3.211e-01  1.269e-01   2.530 0.011462 *  
room_typePrivate room    -4.759e-01  1.914e-02 -24.871  < 2e-16 ***
room_typeShared room     -9.369e-01  1.273e-01  -7.361 2.32e-13 ***
number_of_reviews        -3.203e-04  8.296e-05  -3.861 0.000115 ***
review_scores_rating      2.102e-01  3.483e-02   6.034 1.79e-09 ***
accommodates              9.300e-02  4.342e-03  21.420  < 2e-16 ***
bathrooms                 1.299e-01  2.191e-02   5.930 3.37e-09 ***
host_is_superhostTRUE     7.790e-02  1.459e-02   5.340 9.99e-08 ***
instant_bookableTRUE     -1.909e-02  1.336e-02  -1.429 0.153214    

Residual standard error: 0.3571 on 3088 degrees of freedom
  (13 observations deleted due to missingness)
Multiple R-squared:  0.382, Adjusted R-squared:  0.3802 
F-statistic: 212.1 on 9 and 3088 DF,  p-value: < 2.2e-16
kable(car::vif(model5)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_type 1.181293 3 1.028157
number_of_reviews 1.037548 1 1.018601
review_scores_rating 1.242236 1 1.114556
accommodates 1.234951 1 1.111283
bathrooms 1.116530 1 1.056660
host_is_superhost 1.251743 1 1.118813
instant_bookable 1.053646 1 1.026472
autoplot(model5) +
  theme_minimal() + 
  labs (title = "Model 5 Diagnostic Plots")

  • The adjusted R squared is unchanged from Model 4, and we have an insignificant predictor in instant_bookable, so we discard this variable going forward.

3.1.7 Model 6

We now test the predictive ability of the neighbourhood_simplified variable which we created earlier.

model6 <- lm(lnprice_4_nights ~ room_type+neighbourhood_simplified+ number_of_reviews+review_scores_rating+accommodates+bathrooms+host_is_superhost, data = vienna_train)
msummary(model6)
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           4.120e+00  1.610e-01  25.598  < 2e-16 ***
room_typeEntire home/apt              7.512e-02  1.182e-01   0.636  0.52497    
room_typePrivate room                -4.612e-01  1.867e-02 -24.708  < 2e-16 ***
room_typeShared room                 -4.826e-01  1.654e-01  -2.918  0.00355 ** 
neighbourhood_simplifiedT5           -1.689e-01  1.916e-02  -8.818  < 2e-16 ***
neighbourhood_simplifiedT4           -1.500e-01  1.922e-02  -7.804 8.51e-15 ***
neighbourhood_simplifiedT2            4.415e-02  1.662e-02   2.656  0.00795 ** 
neighbourhood_simplifiedInnere Stadt  3.834e-01  2.755e-02  13.919  < 2e-16 ***
number_of_reviews                    -4.326e-04  8.167e-05  -5.297 1.27e-07 ***
review_scores_rating                  2.081e-01  3.372e-02   6.173 7.72e-10 ***
accommodates                          9.417e-02  4.330e-03  21.749  < 2e-16 ***
bathrooms                             1.304e-01  2.137e-02   6.103 1.19e-09 ***
host_is_superhostTRUE                 7.957e-02  1.436e-02   5.541 3.30e-08 ***

Residual standard error: 0.3298 on 2723 degrees of freedom
  (375 observations deleted due to missingness)
Multiple R-squared:  0.4799,    Adjusted R-squared:  0.4777 
F-statistic: 209.4 on 12 and 2723 DF,  p-value: < 2.2e-16
kable(car::vif(model6)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_type 1.183425 3 1.028466
neighbourhood_simplified 1.057168 4 1.006973
number_of_reviews 1.052265 1 1.025800
review_scores_rating 1.207981 1 1.099082
accommodates 1.237923 1 1.112620
bathrooms 1.127118 1 1.061658
host_is_superhost 1.251711 1 1.118799
autoplot(model6) +
  theme_minimal() + 
  labs (title = "Model 6 Diagnostic Plots")+
  theme(text = element_text(size=6))

  • The introduction of the neighbourhood_simplified variable improves the adjusted R squared greatly to 0.48. Each bucket we computed is a statistically significant dummy variable at the 5% level of significance. However, we lose 362 additional observations. We decide that the increase in explanatory power is worth the loss in data as we still have a sufficient degrees of freedom to avoid over-fitting.

  • The inclusion of the neighbourhood_simplified variable however results in the “Entire home/apt” category of the room_type variable to become statistically insignificant i.e. the mean of the target variable is not statistically significantly different for an entire home/apartment than the baseline room_type in the model (Hotel room) when controlling for other variables. So we should group this category into the baseline category based on this. There must exist a relationship between the neighbourhoods and the room_type variable, which is understandable as more central areas will have more hotels and apartments on average when compared with areas outside/on the outskirts of the city.

# First we need to summarize the specified room_type category into the baseline for training and testing dataset, then we re-factorise them

vienna_train <- vienna_train %>%
  mutate(room_types = case_when(
    room_type %in% c("Hotel room", "Entire home/apt") ~ "Entire property", 
    room_type %in% c("Private room") ~ "Private room",
    room_type %in% c("Shared room") ~ "Shared room"
  ))

# Now we need to factorise the new variable
vienna_train <- vienna_train %>% 
  mutate(
     room_types = fct_relevel(room_types,
                                 "Entire property",
                                 "Private room",
                                 "Shared room"))

vienna_test <- vienna_test %>%
  mutate(room_types = case_when(
    room_type %in% c("Hotel room", "Entire home/apt") ~ "Entire property", 
    room_type %in% c("Private room") ~ "Private room",
    room_type %in% c("Shared room") ~ "Shared room"
  ))

# Now we need to factorise the new variable
vienna_test <- vienna_test %>% 
  mutate(
     room_types = fct_relevel(room_types,
                                 "Entire property",
                                 "Private room",
                                 "Shared room"))

Now we fit model 6 again with the new room_type grouping.

model6 <- lm(lnprice_4_nights ~ room_types+neighbourhood_simplified+ number_of_reviews+review_scores_rating+accommodates+bathrooms+host_is_superhost, data = vienna_train)
msummary(model6)
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           4.123e+00  1.609e-01  25.627  < 2e-16 ***
room_typesPrivate room               -4.616e-01  1.865e-02 -24.746  < 2e-16 ***
room_typesShared room                -4.830e-01  1.654e-01  -2.920  0.00352 ** 
neighbourhood_simplifiedT5           -1.690e-01  1.916e-02  -8.824  < 2e-16 ***
neighbourhood_simplifiedT4           -1.498e-01  1.921e-02  -7.799 8.84e-15 ***
neighbourhood_simplifiedT2            4.414e-02  1.662e-02   2.656  0.00796 ** 
neighbourhood_simplifiedInnere Stadt  3.854e-01  2.737e-02  14.080  < 2e-16 ***
number_of_reviews                    -4.303e-04  8.158e-05  -5.274 1.44e-07 ***
review_scores_rating                  2.077e-01  3.371e-02   6.161 8.30e-10 ***
accommodates                          9.399e-02  4.320e-03  21.756  < 2e-16 ***
bathrooms                             1.307e-01  2.137e-02   6.115 1.10e-09 ***
host_is_superhostTRUE                 7.922e-02  1.435e-02   5.521 3.69e-08 ***

Residual standard error: 0.3297 on 2724 degrees of freedom
  (375 observations deleted due to missingness)
Multiple R-squared:  0.4799,    Adjusted R-squared:  0.4778 
F-statistic: 228.5 on 11 and 2724 DF,  p-value: < 2.2e-16
kable(car::vif(model6)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_types 1.156085 2 1.036925
neighbourhood_simplified 1.041761 4 1.005127
number_of_reviews 1.050158 1 1.024772
review_scores_rating 1.207406 1 1.098820
accommodates 1.232761 1 1.110298
bathrooms 1.126792 1 1.061505
host_is_superhost 1.249866 1 1.117974
autoplot(model6) +
  theme_minimal() + 
  labs (title = "Model 6 Diagnostic Plots")+
  theme(text = element_text(size=6))

  • All of our coefficients on the regressors are now statistically significant. The standard error on the Entire home/apt category coefficient was very large, due to a lack of data.

3.1.8 Model 7

We have been using the number of reviews variable, but we are cautious that it may not be the most current reviews variable in the dataset, especially considering its miniscule coefficient. We now want to select which of the review number variables we should use. We look at the correlation matrix between number_of_reviews, reviews_per_month, number_of_reviews_ltm. We also want to test the effect of including the avalability_30 variable on the model’s explanatory power, as it could be a proxy for general demand.

vienna_train %>% 
  select(lnprice_4_nights, number_of_reviews, reviews_per_month, number_of_reviews_ltm) %>% 
  ggpairs()+
  labs(title = "Correlation Matrix")

  • The number of reviews over the last twelve months is a more current variable than the number of reviews, and demonstrates a stronger correlation with the target variable. We would only like to use one of these variables at most due to the high collinearity between them, and because we want to minimise out-of-sample variance of our mean forecasts. However, the explanatory values of these variables are pretty poor. We tested Model 6 with the number_of_reviews_ltm variable, but it only increased the adjusted R squared by less than 0.01. When we removed the number_of_reviews variable also, the adjusted coefficient of determination barely changed, so we discard this variable on the basis that the additional explanatory value is not worth the increase in model variance and estimation error. It is surprising that the number of reviews has approximately no relationship with the target variable.

We now fit the model with availability_30 included and number of reviews removed.

model7 <- lm(lnprice_4_nights ~ room_types+neighbourhood_simplified+review_scores_rating+accommodates+bathrooms+host_is_superhost+availability_30, data = vienna_train)
msummary(model7)
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           3.6447551  0.1567652  23.250  < 2e-16 ***
room_typesPrivate room               -0.4398531  0.0179669 -24.481  < 2e-16 ***
room_typesShared room                -0.5690356  0.1592463  -3.573 0.000359 ***
neighbourhood_simplifiedT5           -0.1733603  0.0184080  -9.418  < 2e-16 ***
neighbourhood_simplifiedT4           -0.1409184  0.0184791  -7.626 3.33e-14 ***
neighbourhood_simplifiedT2            0.0496294  0.0159820   3.105 0.001920 ** 
neighbourhood_simplifiedInnere Stadt  0.3836262  0.0263297  14.570  < 2e-16 ***
review_scores_rating                  0.2869220  0.0327441   8.763  < 2e-16 ***
accommodates                          0.0880354  0.0041754  21.084  < 2e-16 ***
bathrooms                             0.1374545  0.0205616   6.685 2.79e-11 ***
host_is_superhostTRUE                 0.0577977  0.0136433   4.236 2.35e-05 ***
availability_30                       0.0100769  0.0006399  15.748  < 2e-16 ***

Residual standard error: 0.3173 on 2724 degrees of freedom
  (375 observations deleted due to missingness)
Multiple R-squared:  0.5184,    Adjusted R-squared:  0.5165 
F-statistic: 266.6 on 11 and 2724 DF,  p-value: < 2.2e-16
kable(car::vif(model7)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_types 1.159925 2 1.037785
neighbourhood_simplified 1.031108 4 1.003837
review_scores_rating 1.230617 1 1.109332
accommodates 1.243520 1 1.115132
bathrooms 1.127023 1 1.061613
host_is_superhost 1.220213 1 1.104633
availability_30 1.050895 1 1.025131
autoplot(model7) +
  theme_minimal() + 
  labs(title = "Model 7 Diagnostic Plots")+
  theme(text = element_text(size=6))

  • Availability_30 is statistically significant at the 5% level of significance. It also increases our model’s adjusted R squared by over 0.04 to 0.52, which is a welcome boost. Multicollinearity is relatively low in this model as well which gives us confidence in the approximate independence of our explanatory variables. If availability_30 increases by 1, we would expect, based on the model, that price for two people for four nights would increase by c. 1% on average. This variable is statistically significant at the 5% level of significance.

  • After running the specified regressions, we must look at the remaining variables we have not used so far to potentially optimise our models explanatory value further.

colnames(vienna_train %>% 
  select(-c(lnprice_4_nights,prop_type_simplified,neighbourhood_simplified,review_scores_rating,accommodates,host_is_superhost,availability_30,number_of_reviews, reviews_per_month, number_of_reviews_ltm,instant_bookable,bathrooms, bedrooms, beds,room_type, minimum_nights, maximum_nights, lnprice, dist_from_cent, price_4_nights, availability_365,price,room_types)))
[1] "host_response_time"        "host_response_rate"       
[3] "host_acceptance_rate"      "host_total_listings_count"
[5] "host_identity_verified"   

Relevant variables not tested yet:

  • host_identity_verified
  • host_total_listings_count (not logically deterministic of price)
  • host_acceptance_rate (too many missing entries)
  • host_response_rate (too many missing entries)
  • host_response_time (too many missing entries)

3.1.9 Model 8

We decide to test the host_identity_verified Boolean variable, as we can logically hypothesise that people could potentially pay a premium for the security of the booking. However, the host may not demand a premium for being verified, since it is strictly the seller which determines market price here.

model8 <- lm(lnprice_4_nights ~ room_types+neighbourhood_simplified+review_scores_rating+accommodates+bathrooms+host_is_superhost+availability_30+host_identity_verified, data = vienna_train)
msummary(model8)
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           3.6343325  0.1573285  23.100  < 2e-16 ***
room_typesPrivate room               -0.4385001  0.0180494 -24.294  < 2e-16 ***
room_typesShared room                -0.5647267  0.1593503  -3.544 0.000401 ***
neighbourhood_simplifiedT5           -0.1724669  0.0184439  -9.351  < 2e-16 ***
neighbourhood_simplifiedT4           -0.1397172  0.0185426  -7.535 6.61e-14 ***
neighbourhood_simplifiedT2            0.0502802  0.0160042   3.142 0.001698 ** 
neighbourhood_simplifiedInnere Stadt  0.3836471  0.0263315  14.570  < 2e-16 ***
review_scores_rating                  0.2871503  0.0327477   8.769  < 2e-16 ***
accommodates                          0.0878500  0.0041823  21.005  < 2e-16 ***
bathrooms                             0.1365153  0.0205973   6.628 4.09e-11 ***
host_is_superhostTRUE                 0.0566953  0.0137152   4.134 3.68e-05 ***
availability_30                       0.0100626  0.0006402  15.718  < 2e-16 ***
host_identity_verifiedTRUE            0.0131092  0.0165703   0.791 0.428940    

Residual standard error: 0.3173 on 2723 degrees of freedom
  (375 observations deleted due to missingness)
Multiple R-squared:  0.5185,    Adjusted R-squared:  0.5164 
F-statistic: 244.4 on 12 and 2723 DF,  p-value: < 2.2e-16
kable(car::vif(model8)) %>% 
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
room_types 1.171649 2 1.040398
neighbourhood_simplified 1.040134 4 1.004931
review_scores_rating 1.230712 1 1.109375
accommodates 1.247435 1 1.116886
bathrooms 1.130779 1 1.063381
host_is_superhost 1.232940 1 1.110378
availability_30 1.051727 1 1.025537
host_identity_verified 1.050888 1 1.025128
autoplot(model8) +
  theme_minimal() + 
  labs (title = "Model 8 Diagnostic Plots")+
  theme(text = element_text(size=6))

  • The host identity verification variable is not statistically significant at the 5% level of significance, so we discard it.

  • We have now exhausted our supply of explanatory variables. We can say so far that Model 7 is the best in terms of adjusted R squared so far, but we will compare the models further below using AIC, residual standard error, and out-of-sample RMSE after. We will then perform diagnostics tests on the residuals, as if the distribution of the residuals violates our OLS assumptions, we must re-select the models, or perform heteroskedasticity and autocorrelation robust regressions.

models_compare <- huxreg(model1, model2, model3, model4, model5, model6, model7,model8,
                 statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'AIC' = 'AIC'), 
                 bold_signif = 0.05, 
                 stars = NULL
) %>% 
  set_caption('Comparison of models')
models_compare
Comparison of models
(1)(2)(3)(4)(5)(6)(7)(8)
(Intercept)4.377 4.408 3.706 4.040 4.089 4.123 3.645 3.634 
(0.165)(0.164)(0.152)(0.164)(0.168)(0.161)(0.157)(0.157)
prop_type_simplifiedPrivate room in rental unit-0.624 -0.081                               
(0.021)(0.078)                              
prop_type_simplifiedEntire apartment0.060 0.061                               
(0.029)(0.028)                              
prop_type_simplifiedOther-0.173 0.164                               
(0.036)(0.058)                              
number_of_reviews-0.000 -0.000 -0.000 -0.000 -0.000 -0.000           
(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)          
review_scores_rating0.256 0.249 0.292 0.218 0.210 0.208 0.287 0.287 
(0.035)(0.034)(0.032)(0.034)(0.035)(0.034)(0.033)(0.033)
room_typeEntire home/apt     -0.027 0.298 0.323 0.321                
     (0.150)(0.127)(0.127)(0.127)               
room_typePrivate room     -0.543 -0.479 -0.473 -0.476                
     (0.075)(0.019)(0.019)(0.019)               
room_typeShared room     -0.918 -0.944 -0.927 -0.937                
     (0.150)(0.128)(0.127)(0.127)               
accommodates          0.093 0.093 0.093 0.094 0.088 0.088 
          (0.004)(0.004)(0.004)(0.004)(0.004)(0.004)
bathrooms          0.140 0.130 0.130 0.131 0.137 0.137 
          (0.022)(0.022)(0.022)(0.021)(0.021)(0.021)
host_is_superhostTRUE               0.077 0.078 0.079 0.058 0.057 
               (0.015)(0.015)(0.014)(0.014)(0.014)
instant_bookableTRUE                    -0.019                
                    (0.013)               
room_typesPrivate room                         -0.462 -0.440 -0.439 
                         (0.019)(0.018)(0.018)
room_typesShared room                         -0.483 -0.569 -0.565 
                         (0.165)(0.159)(0.159)
neighbourhood_simplifiedT5                         -0.169 -0.173 -0.172 
                         (0.019)(0.018)(0.018)
neighbourhood_simplifiedT4                         -0.150 -0.141 -0.140 
                         (0.019)(0.018)(0.019)
neighbourhood_simplifiedT2                         0.044 0.050 0.050 
                         (0.017)(0.016)(0.016)
neighbourhood_simplifiedInnere Stadt                         0.385 0.384 0.384 
                         (0.027)(0.026)(0.026)
availability_30                              0.010 0.010 
                              (0.001)(0.001)
host_identity_verifiedTRUE                                   0.013 
                                   (0.017)
#observations3111     3111     3100     3098     3098     2736     2736     2736     
R squared0.238 0.256 0.376 0.382 0.382 0.480 0.518 0.519 
Adj. R Squared0.237 0.254 0.374 0.380 0.380 0.478 0.516 0.516 
Residual SE0.397 0.392 0.359 0.357 0.357 0.330 0.317 0.317 
AIC3081.534 3012.724 2452.743 2423.248 2423.201 1707.104 1496.520 1497.892 
# Get list of adjusted R squared values
ar2_comparison = data.frame(summary(model1)$adj.r.squared,summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared,summary(model5)$adj.r.squared,summary(model6)$adj.r.squared,summary(model7)$adj.r.squared,summary(model8)$adj.r.squared)
# Get out of sample RMSE
RMSE1 = 100*sd((predict(model1, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE2 = 100*sd((predict(model2, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE3 = 100*sd((predict(model3, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE4 = 100*sd((predict(model4, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE5 = 100*sd((predict(model5, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE6 = 100*sd((predict(model6, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE7 = 100*sd((predict(model7, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE8 = 100*sd((predict(model8, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)

RMSE_compare = data.frame(RMSE1, RMSE2, RMSE3, RMSE4, RMSE5, RMSE6, RMSE7, RMSE8)

kable(RMSE_compare)%>% 
  kable_styling()
RMSE1 RMSE2 RMSE3 RMSE4 RMSE5 RMSE6 RMSE7 RMSE8
39.79635 39.06386 36.74797 36.74612 36.70792 33.97237 32.45942 32.45975

Adjusted R-Squared

  • Model 7 demonstrates the highest coefficient of determination

Residual Standard Error

  • Model 7 demonstrates the lowest residual standard error amongst the models, which should give us narrower prediction intervals as desired

AIC

  • Model 7 demonstrates the lowest AIC amongst the models. The akaike information criterion (AIC) is a very useful metric when determining if the addition of additional explanatory variables in a multivariate model justifies the addition of another estimated parameter. It includes a loss function which penalises the model based on the number of coefficients/parameters estimated

Out-of-sample RMSE

  • Model 7 demonstrates the lowest out-of-sample RMSE

Residuals

  • Model 7’s residuals appear to be randomly scattered around its mean of 0, and look approximately normal as they appear to simulate a white noise distribution with no clear pattern as the target variable varies. The QQ plot is approximately along the 45 degree line also, with most residual quantiles well-fitted (note some deviation for extreme values however). It does not appear that we require heteroskedasticity robust standard errors.

  • Therefore, we decide to select model 7 as our optimal model for the price for two people to stay for four nights in Vienna. On average, this model should give us the most accurate predictions for average property rental price for the given input variables with the lowest standard errors amongst the fitted models.

4 Findings and Recommendations

4.1 Model Interpretation

Finally, we use our optimised model for predictions. The model reads:

\(Y = 3.64 - 0.44(Private\ Room) - 0.57(Shared\ Room) - 0.17(T5) - 0.13(T4) + 0.05(T2) + 0.38(Innere\ Stadt) + 0.29(Rating) + 0.09(Accommodates) + 0.14(Bathrooms) + 0.06(Superhost) + 0.01(Availability\_30) + \epsilon_i\)

  • where Y = ln(price_4_nights)

Interpretations of Coefficients (when controling for other variables)

  • Private Room: the negative coefficient shows that you should expect to pay c. 44% less for a private room in a shared accommodation as opposed to a hotel room for four nights (which is the baseline)

  • Shared Room: you should expect to pay c. 57% less for a private room in a shared room as opposed to a hotel room (which is the baseline) for four nights

Logic: travelers pay a premium for privacy

  • T5: you should expect to pay c. 17% less to stay in a T5 neighbourhood (bottom 5 nighbourhoods by mean price) for four nights than a T3 neighbourhood (baseline)

  • T4: you should expect to pay c. 13% less to stay in a T4 neighbourhood for four nights than a T3 neighbourhood (baseline)

  • T2: you should expect to pay c. 5% more to stay in a T2 neighbourhood for four nights than a T3 neighbourhood (baseline)

  • Innere Stadt: you should expect to pay c. 38% more to stay in the Innere Stadt neighbourhood for four nights than a T3 neighbourhood (baseline)

Logic: location is one of the primary considerations for travelers staying in Vienna

  • Rating: you should expect to pay c. 29% more on average to stay in a property with a rating of one point higher than another for four nights, all else equal

  • Accommodates: you should expect to pay c. 9% more on average to stay in a property which can accommodate one more person than another for four nights, all else equal

  • Bathrooms: you should expect to pay c. 14% more on average to stay in a property with one more bathroom than another for four nights, all else equal

  • Superhost: you should expect to pay c. 6% more to stay in a property listed by a superhost on average, all else equal

  • Availability_30: you should expect to pay c. 1% more on average to stay in a property which is available for one more day than another property in the next thirty days, all else equal

Logic: overpriced accommodation is likely to be a deterrent of demand

4.2 Prediction

We first predict the price for a private room, with at least 10 reviews, and an average rating of at least 90 (4.5/5.0). We also assume that the property accommodates 2 people, has one bathroom, and has the average availability_30 in the entire cleaned dataset. We predict for each of the five neighbourhood buckets from most to least expensive We use the model to predict the total cost to stay at this Airbnb for 4 nights.

imaginary_stay1 <- data_frame(room_types="Private room",
                              neighbourhood_simplified = c("Innere Stadt","T2", "T3", "T4", "T5"), # vary at will
                              review_scores_rating = 4.5:5,
                              accommodates=2,
                              bathrooms=1,
                              host_is_superhost=TRUE,
                              availability_30=mean(vienna_listings_reg$availability_30))

pred1=data.frame(exp(predict.lm(model7, imaginary_stay1,  interval = "confidence")), row.names = c("Innere Stadt","T2", "T3", "T4", "T5"))
kable(pred1)%>% 
  kable_styling()
fit lwr upr
Innere Stadt 208.2214 195.3977 221.8867
T2 149.0982 142.3217 156.1974
T3 141.8792 135.4586 148.6040
T4 123.2306 117.2191 129.5503
T5 119.2969 113.4989 125.3911

The table above gives the 95% confidence interval and expectation of the price for two people to stay for four nights with the other criteria mentioned for each of the simplified neighbourhood buckets. This demonstrates the significant difference between the neighbourhoods, with Innere Stadt the obvious positive outlier in terms of price. This table gives the user the location/price trade-off of staying in a nicer or potentially better located area, with the given assumptions above.

#plotting graph using geom_errorbar
difference_pred1<- pred1 %>% 
  ggplot(aes(color=row.names(pred1))) +
  geom_errorbar(aes(y=row.names(pred1),xmin=lwr,xmax= upr),width=0.1,size=1.5,show.legend = FALSE) +
  geom_point(aes(x=fit,y=row.names(pred1)),size=5,show.legend = FALSE)+
  geom_text(aes(label=round(lwr,digits=2),x=round(lwr,digits=2),y=row.names(pred1)),size=3,color="black",hjust=0.5,vjust=-0.75,nudge_x=0.01,nudge_y=0.08,fontface = "bold")+
  geom_text(aes(label=round(upr,digits=2),x=round(upr,digits=2),y=row.names(pred1)),size=3,color="black",hjust=0.5,vjust=-0.75,nudge_x=0.01,nudge_y=0.08,fontface = "bold")+
  geom_text(aes(label=round(fit,digits=2),x=round(fit,digits=2),y=row.names(pred1)),size=3,color="black",hjust=0.4,vjust=-0.75,nudge_x=0.01,nudge_y=0.08,fontface = "bold")+
  theme(legend.position="none",legend.title = element_blank())+
  labs(
    title = "Prediction by Grouped Neighbourhood - Vienna",
    subtitle = "95% Confidence Intervals",
    x = "Prediction Price for Two People for Four Nights",
    y = "Neighbourhoods"
    )+
  theme_economist()+
  NULL

difference_pred1

Note that the 95% confidence intervals for the predictions for T5 and T4 overlap, as do T3 and T2. However, they are statistically significant predictors so we are not alarmed.

4.3 Potential Improvements

Though we are quite happy with our results and the explanatory power of the final model, we must note potential improvements to the process/model:

4.3.1 Obtain cleaner data

The data obtained was flawed (though not fatally flawed!) in numerous ways. The biggest issue with the data was the volume of missing values for potential explanatory variables, which can increase the standard errors of the parameter estimates. Ideally, we would like to maximise the amount of datapoints when fitting our model, but unfortunately this was negated by the inclusion of certain independent variables. The amount of transformations required and outliers in the data also can increase our margin of error in fitting and interpreting the data.

4.3.2 Subjectivity

The subjectivity involved in truncating the tails of the data can take away from the scientific method employed. Other sources of subjectivity/bias include the grouping of categories when simplifying variables and omitting properties with fewer than ten reviews. This non-exact, though logic-based, decision-making can lead to bias in the model.

4.3.3 Overfitting and Estimation Error

The inclusion of twelve coefficient estimates (including the intercept) leaves us prone to over-fitting errors which can increase out-of-sample variance in the residuals of the model. It also leaves us prone to estimation error which can void a model’s usefulness in real life. However, we exercised caution consistently throughout the model selection process when analysing the tradeoff between bias and variance when including additional regressors.

5 Acknowledgements